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Abstract —Co-occurrence Data is a common and important 
information source in many areas, snch as the word co-occurrence 
in the sentences, friends co-occnrrence in social networks and 
prodncts co-occurrence in commercial transaction data, etc, 
which contains rich correlation and clustering information ahont 
the items. In this paper, we study co-occurrence data nsing a 
general energy-hased prohahilistic model, and we analyze three 
different categories of energy-hased model, namely, the L\, L 2 
and Lfc models, which are able to capture different levels of 
dependency in the co-occurrence data. We also discuss how 
several typical existing models are related to these three types of 
energy models, including the Fully Visible Boltzmann Machine 
(FVBM) (E2), Matrix Factorization (L2), Log-BiLinear (LBL) 
models (I/2), and the Restricted Boltzmann Machine (RBM) 
model (I/fe). Then, we propose a Deep Embedding Model (DEM) 
(an Lk model) from the energy model in a principled manner. 
Furthermore, motivated by the observation that the partition 
function in the energy model is intractable and the fact that the 
major objective of modeling the co-occurrence data is to predict 
using the conditional probability, we apply the maximum pseudo¬ 
likelihood method to learn DEM. In consequence, the developed 
model and its learning method naturally avoid the above difficul¬ 
ties and can be easily nsed to compute the conditional probability 
in prediction. Interestingly, our method is equivalent to learning 
a special strnctured deep neural network using back-propagation 
and a special samphng strategy, which makes it scalable on large- 
scale datasets. Finally, in the experiments, we show that the DEM 
can achieve comparable or better results than state-of-the-art 
methods on datasets across several application domains. 

L Introduction 

Co-occurrence data is an important and common data 
signal in many scenarios, for example, people co-occurrence 
in social network, word co-occurrence in sentences, product 
co-occurrence in transaction data, etc. By indicating which 
items appear together in each data sample, it provides rich 
information about the underlying correlation between different 
items, from which useful information can be extracted. There 
are several well-known machine learning models developed 
for analyzing co-occurrence data, e.g., topic model for bags- 
of-words 0; Restricted Boltzmann Machine ll22ll and Matrix 
Factorization ca method for collaborative filtering. These 
statistical models are designed for discovering the implicit or 


explicit hidden structure in the co-occurrence data, and the 
latent structures could be used for domain specific tasks. 

In this paper, we study the unsupervised learning over 
general co-occurrence data, especially the learning of the prob¬ 
ability distribution of the input data, which is a fundamental 
problem in statistics. One of the main objectives of learning 
a probabilistic model from co-occurrence data is to predict 
potentially missing items from existing items, which can be 
formulated as computing a conditional probability distribution. 
In this paper, we focus on energy-based probabilistic models, 
and develop a deep energy model with high capacity and 
efficient learning algorithms for modeling the co-occurrence 
data. Before that, we first systematically analyze the ability of 
the energy model in capturing different levels of dependency 
in the co-occurrence data, and we recognize three different 
categories of energy models, namely. Level 1 (Ti), Level 2 
(L 2 ) and Level k {Lk) models. The Li models consider the 
components of the input vectors (aka. items) to be independent 
of each other, and joint occurrence probability of the items 
can be completely characterized by the popularity of each 
item. The L 2 models assumes items occurs in data are bi¬ 
dependent with each other. The typical L 2 models are Ising 
model ETIl and Fully Visible Boltzmann Machine (FVBM) 
ca. And the model based on Lk assumption is capable of 
capturing any high-order (up to k) dependency among items. 
Restricted Boltzmann Machine (RBM) is an example of Lk 
model However, RBM remains a shallow model with its 
capacity restricted by the number of hidden units. Further¬ 
more, we also study several existing latent embedding models 
for co-occurrence data, especially, Log-BiLinear (LBL) word 
embedding model mi and matrix factorization based linear 
embedding model 1^ . Both of them could be interpreted as 
Bayesian L 2 model, which are closely related to the FVBM 
model. Motivated by such the observation, we propose a Deep 
Embedding Model (DEM) with efficient learning algorithm 
from energy-based probabilistic models in a principled manner 
for mining the co-occurrence data. DEM is a bottom to 
top hierarchical energy-based model, which incorporates both 
the low order and the high order item-correlation features 

Tn (H, RBM model is proved to be a universal approximator, if the size 
of hidden states is exponential to the input dimension. 



within a unified framework. With such deep hierarchical 
representation of the input data, it is able to capture rich 
dependency information in the co-occurrence data. During the 
development of the model and its training algorithm, we make 
several important observations. First, due to the intractability 
of the partition function in energy models, we avoid the use 
of the traditional maximum-likelihood for learning our deep 
embedding model. Second, since our objective of modeling 
the co-occurrence data is to predict potentially missing items 
from existing items, the conditional probability distribution is 
the point of interest after learning. With such observations, we 
will show that the conditional probability distribution is indeed 
independent of the partition function, and is determined by 
the dynamic energy function 1^ . which is easy to compute. 
Moreover, such an observation naturally also points us to 
use the maximum pseudo-likelihood ISl method to learn the 
deep model. Interestingly, we find that the maximum pseudo¬ 
likelihood method for learning DEM is equivalent to training 
a deep neural network (DNN) using (i) back-propagation, and 
(ii) a special sampling strategy to artificially generate the 
supervision signal from the co-occurrence data. The equivalent 
DNN has sigmoid units in all of its hidden layers and the 
output layer, and has an output layer that is fully connected 
to all its hidden layers. Therefore, the training algorithm is 
a discriminative method, which is efficient and scalable on 
large-scale datasets. Finally, in experiments, we show that 
DEM could achieve comparable or significantly better results 
on datasets across different domains than the state-of-the-art 
methods. 

Paper Organization: In Section 2, we provide a brief 
review of the related work on statistic models on co-occurrence 
data. In Section 3, we formally describe the Bayesian depen¬ 
dence framework for learning the high-dimensional binary data 
distribution. In Section 4, we introduce the deep embedding 
model and pseudo-likelihood principle for model parameter es¬ 
timation, and in Section 5, we report the detailed experimental 
results, and finally conclude the paper in Section 6. 

II. Related Work 

There are several proposed models in the literature for 
estimating the distribution of binary data. The Bayesian mix¬ 
ture model El d is the most common one, which assumes 
the binary data to be generated from multivariate Bernoullis 
distribution. In a, it argued that a better performance can 
be achieved by modeling the conditional probability on items 
with log-linear logistic regressors. The proposed model is 
named fully visible sigmoid belief networks. While RBM 
proposed in ||9l, is a universal approximator for arbitrarily data 
distribution. It is shown in lfT4ll that, tractable RBMs could 
outperform standard mixture models. Recently, a new Neural 
Autoregressive Distribution Estimator (NADE) is proposed 
in iia. Experiments demonstrate that NADE could achieve 
significant improvement over RBMs on density estimation 
problem. However, the limitation of NADE is that it requires 
the a priori knowledge of the dependence order of the vari¬ 
ables. Although NADE could achieve promising results for 
modeling data distribution, it is intractable for estimating the 
conditional probability of variables. Furthermore, a multi-layer 
neural network method is proposed for data density estimation 
in m. But the high model complexity (the number of free 
model parameters is 0{HN‘^), where H is the number of 


hidden neurons, and N is the dimension of input data) restricts 
its application in practice. 

Dimension Reduction, i.e., EOl and matrix factorization 
i.e., m are two common types of embedding techniques. 
However, both of the two approaches focus on learning low¬ 
dimensional representation of objects while reserving their 
pair-wise distances. However, the co-occurrence data may have 
the high-order dependence (we will discuss it in the following 
section); Therefore, instead of reserve the distance between one 
object to another, the Deep Embedding model would capture 
the high-order dependence, i.e., correlation between multiple- 
objects and another one. 

The Deep Embedding Model (DEM) proposed in this paper 
is derived as a model for estimating the data distribution. 
We evaluate the performance of DEM on the missing item 
prediction task, which will show that the proposed DEM 
significantly outperforms most of the existing models. 

DEM is also closely related to the autoencoder 1^ models, 
which contains two components: encoder and decoder. The 
encoder maps the input data to hidden states, while the decoder 
reconstructs the input data from the hidden states. There 
are also some studies to connect denoising auto encoder to 
generative learning JJ], EtI . ESII . Indeed, DEM could be 
also viewed as an special case of denoising autoencoder. 
In encoder phase, the input data is corrupted by randomly 
dropping one element, and is then fed into the encoder function 
to generate the hierarchical latent embedding vectors. Then, in 
the decoding phase, the missing items are reconstructed from 
latent vectors afterwards. 

HI. Co-occurrence Data Modeling 

In this section, we first introduce the basic notation of the 
paper and then present the Bayesian dependency framework 
for analyzing the existing models. 

Let V denote the set of the co-occurrence data, which 
contain W-dimensional binary vectors v G {0,1}^, where JV 
is the total number of items. Specifically, the value of the n-th 
entry of the vector v is equal to one if the corresponding item 
occurs, and it is equal to zero if it does not occur. For example, 
in word co-occurrence data, JV denotes the vocabulary size, 
and the values of the entries in vector v denote whether the 
corresponding words appear in the current sentence. 

The fundamental statistical problem for co-occurrence 
learning can be formulated as estimating the probability mass 
function (pmf), pg(v),v G {0,1}^, from the observation 
dataset, V. A straight-forward method for pmf estimation is 
to count the frequency of occurrence of the v in the entire 
corpus V, given V contains infinite i.i.d samples. However, 
it is unrealistic in practice because it requires us to learn a 
huge table of 2^ entries, where N can be as large as tens 
of thousands in many applications. Therefore, a practically 
feasible method should balance the model complexity and 
capability for co-occurrence data modeling. Throughout the 
paper, we consider the probability mass function ps(v) that 
can be expressed by the following general parametric form: 

pg(v) = VG{0,1}^ 


( 1 ) 


where Ee{v) is the energy function on data v with parameter 
9, and Z is the partition function that normalizes pe{v) so 
that it sums up to one, which is a function of 9. In the 
following subsections, we introduce three Bayesian depen¬ 
dence assumptions, namely, Li, L 2 and Lk on the model Q, 
where the energy function Eq{v) would assume different forms 
under different assumptions. Within this framework, we will 
show that several popular statistical models fall into different 
categories (special cases) of the above framework, and we 
will also explain how different types of models are able to 
trade model capacity with model complexity. Moreover, the 
Bayesian dependence framework would further motivate us 
to develop a deep embedding model for modeling the co¬ 
occurrence data, which will be discussed in Section [TVl 

A. Bayesian Li Dependence Assumption 

We first consider the Li Bayesian Dependence Assump¬ 
tion, where the items in co-occurrence data are assumed to be 
independent of each other so that the probability mass function 
of V can be factorized into the following product form: 

Peiy) = p{i) n (1 - p(*)) (2) 

iGlv 

where /„ denotes the set of the items occurred in v, and p{i) 
is the occurrence probability of the i-th item. Note that, in 
this case, the joint probability mass function pe(u) is factored 
into the product of the marginal probabilities of the entries of 
the vector v. The pmf in (|^ could be further rewritten in the 
parametric form Q with the energy function in this case being 

E^^{v) = h^v=Y,h 

where bi = — Inp(i) + ln(l — p{i)) is the negative log- 
likelihood ratio for the i-th item. 

B. Bayesian L 2 Dependence Assumption 

Likewise, for the Bayesian L 2 dependence, the energy 
function Eg{v) in ([T]) assumes the following form: 

Eg^{v) =v'^Wv + b^v (3) 

where W is a N x N symmetric matrix with zero diagonal 
entries. The energy function Q could also be written in the 
following equivalent form: 

Ee%v) = '£b.+ Y. (4) 

One typical model with L 2 assumption is Markov Random 
Field Model (or Fully Visible Boltzmann Machine model 
(FVBM) nil, or Ising Model EH ), which is widely used 
in image modeling 


Note that, as k increases, the above energy function is able 
to capture high-order correlation structures, and the model 
complexity also grows exponentially with k. 


D. Conditional Probability Estimation 


So far we have introduced the energy-based probabilistic 
model for co-occurrence data and its particular forms in 
modeling different levels of dependency, i.e., Li, L 2 and 
Lk models. The classical approach for learning the model 
parameters of such an energy-based model is the maximum 
likelihood (ML) method. However, the major challenge of 
using the ML-based method is the difficulty of evaluating 
the partition function Z and its gradient (as a function of 
0 ) in the energy model Q. Nevertheless, in many practical 
problems, the purpose of learning the probability distribution 
of the input (co-occurrence) data is to predict a potentially 
missing item given a set of existing items. That is, the potential 
problem is to find the probability of certain elements of the 
vector V given the other elements of v. For example, in the 
item recommendation task, the objective is to recommend 
new items that a customer may potentially purchase given 
the purchasing history of the customer. In these problems, the 
conditional probability of the potentially missing items given 
the existing items is the major point of interest. As we will 
proceed to show, learning an energy model that is satisfactory 
for prediction using its associated conditional probability does 
not require estimating the partition function in Q. In fact, 
we now show that it is actually convenient to compute the 
conditional probability from the energy model 0. Specifically, 
the conditional probability can be computed from the energy 
function via the following steps: 


lnpe(ut = l|t;(-t)) = In 


= In 


= In 


_ Pejvt = i,vi-t)) _ 

pe{vt = 1,U(_()) +P 0 {vt = 

P9{v{+t)) 

Pe{v(+t)) +P9{v(-t)) 

g--Ea(«(+t)) 


1 -f exp {Eg{v(+t)) - Ee{v(^_t))} 

= liia {Ee{v(^_t)) - Ee{v(^+t})) ( 6 ) 


where 'U(-t) G {0,1}^ is the input vector indicates the existing 
items (with f-th entry being zero), r'(+t) G {0,1}^ is an N- 
dimensional vector (with the f-th entry being one and all other 
entries equal to f(-t)); cr(-) is the logistic function defined as 
(j{x) = l/(l-|-e“^). Since f(+t) is only one bit different from 
U(_ 4 ), we define the dynamic energy function as 

Eg {t,v) = Eg (u(+t ))- Ee{v) (7) 


C. Bayesian Lk Dependence Assumption 

The Bayesian Lk dependence assumption is proposed 
to model any high-order correlations among items in co¬ 
occurrence samples. Thus, we extend the classical L 2 FVBM 
model with Lk FVBM. The new energy function for Lk FVBM 
could be given as follows: 

E^'‘{v) = Yb^+ + E .fe (5) 


where let v equals to for notation simplification. As a 

result, the log conditional probability can be written as 

lnpg{vt = l|t^(-t)) =lna {Fg{t,v)) (8) 

Note from 0 that the conditional probability pe(ut = l|t;(_t)), 
which is of interest in practice, no longer depends on the 
partition function, but only on the dynamic energy function 

^Jascha Sohl-Dickstein et al. first introduced the concept of dynamic energy 
in minimum probability flow method m 








Fe{t,v). Therefore, from now on, we only need to study the 
specific form of the Fg{t,v) for different Li, L 2 and 
models, which can be computed as 

Fg^^{t,v)=bt (9) 

Fg^^{t,v)=bt + Y,W^t (10) 

Fg’’{t,v) = bt+ '^^Wit + ■ ■ ■ + (11) 

where F^ (t^ y'j is a constant function for any given t; 
Fg"^ (f, v) is a linear function; and Fg’° (t, v) is a nonlinear 
function of the variable v. 

E. Relation to Several Existing Models 

We now briefly introduce the relation of our Li, L 2 and 
Lk formulation to several typical existing models for co¬ 
occurrence data modeling. 

Log-Bilinear (LBL) Embedding Model: Mnih and Hin¬ 
ton et al. IfTSll introduce a neural language model which uses a 
log-bilinear energy function to model the word contexts. In its 
Log-Bilinear model, the posterior probability of a word given 
the context words is given by 

\ogpg{t\v)(X(j)J'^(l)i = (j)f<^V (12) 

i^I-u 


that of Therefore, MMMF model is also related to L2 
dependence model. 

Restricted Boltzmann Machine (RBM): the Restricted 
Boltzmann machine is a classical model for modeling data 
distribution. Early theoretical studies show that the RBM can 
be a universal function approximator. It could learn arbitrarily 
data distributions if the size of hidden states is exponential in 
its input dimension M- Typically, an RBM is expressed in; 

Pe{v) = Y^piv, M = I E (16) 

h h 

By integrating out the hidden variables in ( [T6] l, we could obtain 
its energy function on v: 

H 

= + ’'+='') (17) 

h=l 

where H is the number of hidden states, the term ln(l -f e^) 
is the soft-plus function. It can be considered as a smoothed 
rectified function. In M, it is proved that soft-plus function 
can approximate any high-order boolean function. By allow¬ 
ing the number of hidden states be exponential to the item 
numbers, the energy function in ( [TtI i could approximate the 
Lk energy function Therefore, RBM can be interpreted as 
an Lk dependence assumption. 

IV. Deep Embedding Model 


where (j)t is the vector representation of word t, $ is the word 
embedding lookup table. (j)t is the f-th row of $. As we can 
see, the formulation 0 is a linear function over v for any 
given t. Thus, LBL embedding model, to some extent, can be 
interpreted as an L 2 dependence model. 

Matrix Factorization: Matrix Eactorization based ap¬ 
proaches are probably the most common latent embedding 
models for co-occurrence data. The maximum margin matrix 
factorization (MMME) model ll24ll learns the latent embedding 
of items based on the following objective function: 

($, Z) = argmin E {F ~ -f /3|.^|^ (13) 

' v'&V 

where denotes the z-th data sample for training, 0 * is the 
latent representation for the z-th sample, and <I> is the item 
embedding matrix. 

When predicting the scores of missing items from obser¬ 
vation V, MMMF first estimates the hidden vector via 

z = argmin(z; — (14) 

Z 

and then the score function of the missing item t given v could 
be computed as 

S{t]v) = (15) 

where (pt is the vector representation of item t, it is the t- 
th row of matrix <I>. The formulation 0 is very similar to 

^The original Log-Bilinear model contains the transform matrix C for 
modeling word position information, i.e., conditional probability for next 
word: logp£)(t|zj) oc 4’t'^i^i We remove the transform matrix C 

since the position information is assumed to be not available in co-occurrence 
data. 


In this section, we present the Deep Embedding Model 
(DEM) for co-occurrence data modeling. As we discussed in 
the previous section, many classical embedding models only 
capture L 2 dependency, and RBM, although being an Lk 
model, has its capability bounded by the number of hidden 
states. Motivated by the above observation, we propose a 
deep hierarchical structured model that is able to capture 
the low-order item dependency at the bottom layer, and the 
high-order dependency at the top layer. As we discussed 
in section III-D our objective is to learn an energy model 


that allows us to perform satisfactory prediction using its 
associated conditional probability pg{vt = l|u(_t)) instead of 
the original pg{v). And recall from (§ that the conditional 
probability is determined only by the dynamic energy function 
Fg{t,v) and is independent of the partition function. We 
first propose a deep hierarchical energy model by giving its 
dynamic energy function, and then show how to learn the deep 
model efficiently. 


The dynamic energy function for the deep embedding 
model is given by 

Fg^^^{t,v) = bt +E Ru+R\hi+R^th2 + -FR^thk 

( 18 ) 


where {hi,h 2 , ■■■,hk} are the hidden variables computed ac¬ 
cording to a feed-forward multi-layer neural network: 

hi=a{W^v + B^) (19) 

hi=a{W^hi-i+B^) i = 2,...,k (20) 

where a{-) is the logistic (sigmoid) function; 
{(FF*, are the model weights in multi-layer 

neural networks. In the expression ([T^, there is a set of 




hierarchical structured embedding vectors i?^} 

assigned to each item t, where the inner product between the 
hidden variables hi and could approximate any weighted 
high-order boolean functions on v. Therefore, the proposed 
DEM could capture Lk dependency. Notice that we also keep 
the terms corresponding to the Li and L 2 dependency in the 
dynamic energy function of DEM, which makes the model 
more adaptive to different data distribution. 

As we discussed earlier, due to the difficulty of handling 
the partition function in the energy model and the fact that we 
only need a conditional probability in our prediction tasks, we 
avoid the use of the traditional maximum likelihood principle 
Ha for modeling the co-occurrence data, which seeks to solve 
the following optimization problem; 

9* = argmax lnp6i(u) (21) 

vev 

Eor the same reason, we also present our deep embedding 
model by giving its dynamic energy function directly, which 
can be used for computing the conditional probability easily. 
Eurthermore, in the paper, we will use an alternative approach, 
named maximum pseudo-likelihood principle for learning 
the model parameters of DEM, which seeks to maximize the 
conditional probability function: 


N 

9* = arg max EE \np 0 {vt\v(_t)) (22) 

where is the data sample v with the <-th entry missing 
1^ and vt G {0,1} is the f-th entry of u . By substituting (|^ 
into ( |2^ , we obtain 

N 

9* = arg max EE \n(T{Fe{t,v)) 

v^V 

N 

= arg max EE \na{E 0 {v^f^) - Ee(v)) (23) 


where is the neighbor of v (with the t-th entry flipped 
from Vt and all other entries equal to v, which has a unit 
Hamming distance from v). Note that, expression 23 can be 
further written as 


9* = arg max ^ 


vev 


^ \na{Fe(t,v)) 

(24) ■ 


Erom ( |24| ) and Eigure[T] we note that the maximum pseudo¬ 
likelihood optimization of our deep energy model is equivalent 
to train a deep feed-forward neural network with the following 
special structure: 

• The nonlinearity of the hidden units is the sigmoid 
function. 

• The output units are fully connected to all the hidden 
units. 


the following sections, v and f(_t) ate different. They could be equal 
to each other when vt = 0. 


^(-t) hi h2 hi^ 



Fig. 1. Illustration of the computation of the dynamic energy function in 
Deep Embedding Model, where the red cross means a missing item. 


• The nonlinearity of the output units is also the sigmoid 
function. 

Furthermore, the training method is performing back- 
propagation over such a special deep neural network (DNN). 
However, the method is also different form the traditional back- 
propagation method in its choice of the supervision signal. 
The traditional back-propagation method usually uses human 
labeled targets as its supervision signal. In the co-occurrence 
data modeling, there is no such supervision signal. Instead, we 
use a special sampling strategy to create an artificial supervi¬ 
sion signal by flipping the input data at one element for each 
sample, and the algorithm performs discriminative training 
for such an unsupervised learning problem. Interestingly, the 
maximum pseudo-likelihood learning strategy for our proposed 
deep energy model is equivalent to discriminatively training the 
special DNN in Figure [fusing back-propagation and a special 
sampling strategy. In the next subsection, we will explain the 
details of the training algorithm. 


A. Model Parameter Estimation 

To maximize the objective function of DEM in ( |24l l, we 
apply the stochastic gradient descent method to update model 
parameters for each data sample, v. We omit the details of 
gradient derivation from the objective function. The following 
updating rules are applied : 

First, randomly select a element vt from v\\f t G Iv, then 
Vt = 1, otherwise Vt = 0. Second, compute the A{v,t): 


a{Fe{t,vt^_t))), iftGlv 
1 — a{Fg{t,v)), otherwises. 


Update b: 



Abt = A{v,t) 

(25) 

Update : 



AR°, = Aiv,t) 

i G Ivii t) 

(26) 

Update R\ and i?* ; 



ARl = A{v,t)hi 


(27) 

AW^ = {(A{v,t)Ri + L,)ol 

1* o (1 - 

(28) 

AB, = {{A{v,t)Rl + L,)oh 

i o (1 - h,)) 

(29) 



















Algorithm 1 SGD for training Deep Embedding Model 


Input: Data v, DEM model, Negative Sample Number T 
Output: Updated DEM model 
for Select t from ly, t € ly do 

Calculate the Dynamic Energy Eunction Fg{t,V(_t)) in 


Calculate 

Update model parameters by |25] t 


Update model parameters by ( [25] l 
end for 
for i = 1 to T do 

Randomly select t ^ ly 

Calculate the Dynamic Energy Eunction Fe{t,v) in 18 
Calculate A{v,t) = 1 — a{Fg{t,v)) 

Update model parameters by @ ^ 

end for 


Online Rating Data : Online Rating datasets contains two 
datasets — MovieLenlOM and Jester. MovieLenlOM0is the 
movie rating data set with ratings ranging from 1 to 5. Jester]^ 
is an online joke recommender system. Users could rate jokes 
with continuous ratings ranging from -10 to 10. Users in rating 
dataset can be represented as sparse rating vectors. In our 
experiment setting, we transform the real-value ratings into 
the binary value by placing rating threshold, i.e., ratings equal 
or larger than four will be treated as one, otherwise zero in 
MovieLenlOM dataset. In Jester, the rating threshold is zero. 
Jester dataset contains 101 unique jokes, 24,944 users, and 
756,148 ratings above zero. MovieLenlOM contains 10,104 
unique movies, 69, 765 users and 3, 507, 735 ratings equal or 
larger than four. Both the two datasets are divided into five 
cross folders, each folder contains 80 percent users for training, 
and 20 percent users for testing. 


where indicates U(_t) if i S otherwise u; {Li} can be 
given as follows: 

In the details of implementation, we do not enumerate 
all the t ^ ly, but sample a fix number (T) of samples to 
speed up the training process. Algorithm [T] describes details of 
applying stochastic gradient descent method for training the 
Deep Embedding Model. 

V. Experiment 

In this section, we validate the effectiveness of the Deep 
Embedding Model (DEM) empirically on several real world 
datasets. The datasets are categorized into three domains: So¬ 
cial networks. Product Co-Purchasing Data and Online Rating 
Data. We first introduce details of our experiment datasets. 

Social networks : The social networks are collected from 
two sources: Epinion|^ Slashdot Both of them are directed 
graphs. The user in social networks has an unique uid. The 
social connections of the user is represented as a binary sparse 
vector, which contains the friends-occurrence information. In 
experiments, users in social network datasets are divided into 
five cross folders, each folder contains 80 percent users for 
training, and 20 percent users for testing. Eor the user in test 
set, it will randomly remove one of her/his connections to 
others. Statistic models will predict the missing edge according 
to the existing connections. Epinion dataset contains 75, 879 
users and 508,837 connections; Slashdot dataset contains 
77,360 users and 905,468 connections; 

Product Co-Purchasing Data : Product co-purchasing 
datasets are collected from a anonymous Belgian Retail store 
Q The transaction sets are divided into five cross folders, each 
folder contains 80 percent users for training, and 20 percent 
users for testing. Eor each transaction record in test set, it will 
randomly remove one item from the list. Model performance 
is measured by the number of missing items being correctly 
recovered. In the Retail dataset, it contains 15,664 unique 
items, 87,163 transaction records and 638,302 purchasing 
items. 


^Epinion network http://snap.stanford.edu/data/soc-Epinionsl.html 
^Slashdot network https://snap.stanford.edu/data/soc-Slashdot0811 .html 
^Retail dataset http://fimi.ua.ac.be/data/retail.data 


A. Evaluation 

In the experimental study, we make use of the missing item 
prediction task for evaluating model performance. All the data 
sets are divided into five cross folders. Records in test sets are 
represented as an binary sparse vector with one of its nonzero 
element missed. Eor each test record v, we use to denote 
the ground truth of the missing item index, Pk{v) denote the 
predicting TopK item index list. TopK Accuracy is used as the 
main evaluation metric in experiments. The formal definition 
of TopK Accuracy is given as follows: 

Top@KAcc = & Pk{v)) (30) 

I I dGT 

where T is the whole test set, I{x) is the boolean indicator 
function; If x is true, I{x) = I; otherwise I{x) = 0. In 
experiments, Top@l Acc and Top@10 Acc are two key 
indicators for model comparison. 

B. Experiment Results 

In this section, we report the performance of proposed 
Deep Embedding Model (DEM) compared with other state-of- 
the-art baselines. Specifically, the following baseline methods 
are compared: 

Co-Visiting Graph (CVG) llH: Co-Visiting Graph method 
computes the item co-occurrence graph; where the weighted 
edge between two items is the number of times the items co¬ 
occur; In prediction phase, CVG scores the candidate item by 
summing all the edge weights linked from existing items. 

Normalized CVG (Norm CVG): Norm CVG is an variant 
of CVG method; where the edge weight in Norm CVG is 
normalized by the frequency of items. 

Local Random Walk (LRW) ll29l : LRW computes the 
similarity between a pair of items by simulating the probability 
of a random walker revisiting from the initial item to the 
target item. LRW method performs random walk algorithm 
based on the co-visiting graph, it could be alleviating the 
sparsity problem in the graph. In experiments, the number 
of steps in random walk algorithm are varied from 1 to 4; 

^MovieLenlOM http://grouplens.org/datasets/movielens/ 

^Jester dataset www.ieor.berkeley.edu/ goldberg/jester-data/ 












The results reported are based on the parameter configurations 
which produce the best results. 

Latent Dirichlet Allocation (LDA) 13: LDA model can 
be viewed as an variant of matrix factorization approach, 
where the items co-occurrence information is assumed to be 
generated by latent topics. In prediction phase, LDA estimates 
the latent topic distribution given existing observed items, and 
it generates the most probable missing items according to topic 
distribution. The number of topics in LDA model is varied 
from 32 to 512 in experiments. 

Restricted Boltzmann Machine (RBM)||23 : RBM is an 

general density estimation model, which could be naturally 
used for missing prediction task lEa. In the experiments, the 
number of hidden states in RBM model is varied from 32 to 
512. 

LogBilinear (LBL) Model ifTSll : LBL is first proposed for 
language modeling task im. In our experiments, item position 
information is not available. Therefore, a simpler version 
of LBL model is implemented by removing the position 
variables. The number of embedding dimension for LBL is 
varied from 32 to 512 in experiments. 

Fully Visible Boltzmann Machine (FVBM) lfTOl : FVBM 

is an type of Markov Random Field Model as described in 
section 2. 

Denoising AutoEncoder (DAE) ||2l- 

Deep Embedding Model (DEM) ; DEM could be con¬ 
figured with different number of hidden layers and different 
number of hidden states. In experiments, we select the number 
of hidden states varied from 8 to 512, and the number of hidden 
layers from 1 to 3. 

The experiment environment is built upon machine Inter 
Xeon CPU 2.60 (2 Processors) plus four Tesla K40m GPUs. 
Except CVG, NormCVG, LRW and LDA methods, all other 
approaches run on GPU. 

In the Table |I] we provide a detailed comparison of these 
nine approaches in terms of Top@ 1 and Top@ 10 prediction ac¬ 
curacy on MovieLenlOM (Top500 Movies) and Jester datasets. 
Proposed DEM method shows significant improvements over 
all baselines on MovieLenlOM (Top500) dataset. On Jester 
dataset, DEM significant outperforms other baselines except 
LBL and FVBM. The running time includes both training time 
and prediction time. 

In Table |ig it shows the experiment results on Movie¬ 
LenlOM (full) and Retail datasets; As we could see in the 
table |I^ DEM could achieve significant better results than all 
the baseline methods except FVBM on MovieLenlOM dataset; 
On retail dataset, DEM could outperforms all other baselines 
except LBL. Compared with other baselines, LBL and FVBM 
could obtain relative stable results on all the four dataset. 
It could show that Bayesian Bi-Dependence models could 
largely approximate to true data distribution in some real word 
applications. However, DEM could consistently outperform 
both FVBM and LBL shows that by incorporating the high- 
order dependence terms, DEM could typically achieve better 
results. 


In Table III we compared all the nine approaches on social 
network datasets. From the table imi it shows that heuristic 
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Fig. 2. Hyper-Parameters Selection for Deep Embedding Model on Movie- 
LenlM Data 


methods CVG, NormCVG and LRW outperforms most of 
the advanced algorithms on Epinions and Slashdot dataset. One 
possible reason may due to the number of tail users in social 
networks is much larger than online Rating dataset and Product 
Co-Purchasing dataset. Only DAE and DEM models could 
outperform heuristic based models on Epinions and Slashdot 
dataset, while DEM significantly beat all baselines. 

An interesting result from experiments is that the de¬ 
pendence model RBM does not perform better than L 2 depen¬ 
dence models (i.e., LBL and EVBM). There would be many 
factors to affect the model performance on different datasets, 
i.e., local optimization algorithm, hyperparameter selection, 
etc. Almost all the statistic models are biased towards/against 
some data distribution. Eor the experiment datasets in multiple 
domains, it is impractical to assume data generated from 
single distribution assumption. Therefore, in DEM, it proposed 
an from bottom to up schema to gradually learn the data 
distribution from low-order dependence assumptions to high- 
order dependence assumptions. 

C. An Analysis of Model Hyperparameters 

In the subsection, we empirically analysis the hyperparam¬ 
eters in DEM. We take MovieLenlM dataset for experiment 
to show that how the model performance varied by selecting 
different model hyperparameters. In the Figure we compare 
the results of Topi accuracy on different hyperparameter 
settings; DEM-0 indicates the deep embedding model with no 
hidden layers. DEM-0 is equals to FVBM. DEM-8, DEM-16, 
DEM-32 and DEM-64 indicate the model has single hidden 
layer, with number of hidden states be 8, 16, 32, and 64 
respectively. Likewise, DEM-32 x 16 indicate the model 
contains two hidden layers with 32 and 26 hidden states at each 
layer respectively. From the Figure we see the DEM-32 x 
16 could achieve the best performance compared with other 
hyperparameter settings of DEM. However, the improvement 
of DEM-32 X 16 over the other models is not significant. 

D. Learning Representations 

The DEM provides an unified framework which could joint 
train the Li dependence (Bias) term, L 2 dependence term and 
Lk dependence (hierarchical latent embedding) term together 
for dynamic energy function estimation. In the deep learning 
area, it proposed that items’ hidden semantic representations 
could be extracted from un-supervision data signal m, i.e.. 






















TABLE I. TOP@ 1 AND TOP@10 PREDICTION ACCURACY ON MOVIELEN10M(500) AND JESTER(IOI) DATASET. SUPERSCRIPTS Q, [3, 7 AND S 
INDICATE STATISTICALLY SIGNIFICANT IMPROVEMENTS (p < 0.01) OVER DAE, FVBM, LBL AND RBM 


Models 

MovieLenIOM (TOP500 Movies) 


JESTER (101 Jokes) 



TOP@l 

TOP@10 

Run Time 

TOP@l 

TOP® 10 

Run Time 

CVG 

4.26 ± 0.23 

19.56 ± 0.21 

« 10 sec 

16.59 ± 0.20 

59.70 ± 0.34 

Ri 2 SEC 

NormCVG 

4.73 ± 0.23 

21.02 ± 0.25 

r; 10 sec 

16.65 ± 0.25 

59.98 ± 0.36 

Ri 2 SEC 

LRW 

4.40 ± 0.24 

20.28 ± 0.33 

Ri 50 sec 

16.57 ±0.22 

59.98 ± 0.39 

Ri 10 SEC 

LDA 

6.70 ± 0.35 

30.11 ± 0.41 

Ri 1000 SEC 

15.88 ± 0.30 

62.88 ± 0.41 

Ri 100 SEC 

RBM 

10.52 ± 0.33 

39.40 ± 0.63 

Ri 500 SEC 

19.66 ± 0.29 

68.95 ± 0.67 

Ri 50 SEC 

LBL 

10.42 ± 0.33 

38.49 ± 0.44 

r; 300 SEC 

20.21 ± 0.27 

69.46 ± 0.59 

Ri 10 SEC 

FVBM 

10.77 ±0.35 

39.34 ± 0.48 

r; 400 SEC 

20.35 ± 0.28 

69.18 ± 0.42 

Ri 10 SEC 

DAE 

10.50 ±0.34 

39.41 ± 0.47 

200 SEC 

19.35 ± 0.35 

68.16 ± 0.77 

Ri 10 SEC 

DEM 

11.32 ± 0.42 

41.33 ± 0.75 

r; 400 SEC 

20.56 ±0.23^ 

69.46 ± 0.66 

Ri 10 SEC 


TABLE 11. T0P@1 AND TOP@10 PREDICTION ACCURACY ON MOVIELENIOM (10,269) AND RETAIL) 16,469) DATASET. SUPERSCRIPTS Q, (3, 7 AND 

5 INDICATE STATISTICALLY SIGNIFICANT IMPROVEMENTS (p < 0.01) OVER DAE, FVBM, LBL AND RBM 


Models 

MovieLenIOM (10,269 Movies) 



Retail (16,469 Items) 




TOP@l 

TOP® 10 

Run Time 

TOP® 1 

TOP® 10 

Run Time 

CVG 

3.24 ± 0.12 

14.34 ± 0.15 

Ri 

10 sec 

13.48 ± 0.17 

25.30 ± 0.27 

Ri 

10 SEC 

NormCVG 

3.74 ± 0.13 

16.01 ± 0.17 

Ri 

10 sec 

13.92 ± 0.10 

28.01 ± 0.36 

Ri 

10 SEC 

LRW 

3.54 ±0.14 

15.51 ± 0.08 


1800 sec 

13.92 ± 0.08 

27.56 ±0.34 


200 SEC 

LDA 

4.15 ± 0.13 

18.95 ± 0.06 

Ri 

2600 sec 

13.30 ± 0.16 

24.76 ±0.32 

Ri 

1300 SEC 

RBM 

4.69 ± 0.17 

20.80 ± 0.02 


1800 sec 

12.74 ± 0.29 

23.72 ± 0.37 


800 SEC 

LBL 

6.67 ±0.12 

26.45 ± 0.33 

Ri 

700 sec 

15.05 ± 0.20 

26.00 ± 0.26 

Ri 

300 SEC 

FVBM 

7.60 ± 0.35 

29.61 ± 0.43 


1200 SEC 

14.38 ± 0.12 

27.48 ± 0.34 


400 SEC 

DAE 

5.41 ± 0.18 

23.80 ± 0.43 


900 sec 

13.13 ± 0.26 

25.04 ±0.32 


400 SEC 

DEM 

7.77 ± 0.19 

30.01 ± 0.86 


1600 sec 

15.49 ± 0.23 " 

^ 28.45 ± 1.47 


400 SEC 


TABLE III. TOP@ 1 AND TOP@ 10 PREDICTION ACCURACY ON EPINIONS (75,879) AND SLASHDOT(77,360) DATASET. SUPERSCRIPTS a, /3, 7 AND S 
INDICATE STATISTICALLY SIGNIFICANT IMPROVEMENTS (p < O.Ol) OVER DAE, FVBM, LBL AND RBM 


Models 

TOP@l 

EPINIONS (75,879 Users) 
TOP® 10 

Run Time 

TOP®l 

Slashdot (77,360 Users) 
TOP® 10 

Run Time 

CVG 

4.04 ± 0.45 


12.76 ± 0.44 


30 SEC 

2.12 ± 0.22 

6.80 ± 0.35 

Ri 30 SEC 

NormCVG 

5.01 ± 0.50 


15.53 ± 0.57 


30 SEC 

2.65 ± 0.21 

7.64 ± 0.38 

Ri 30 SEC 

LRW 

5.17 ± 0.43 


15.91 ± 0.51 


1500 SEC 

2.62 ± 0.23 

7.65 ± 0.41 

Ri 2000 SEC 

LDA 

1.41 ± 0.07 


6.50 ± 0.61 


5100 SEC 

0.96 ± 0.14 

4.49 ± 0.52 

Ri 8500 SEC 

RBM 

2.45 ± 0.15 


11.49 ± 0.59 


3300 SEC 

1.81 ± 0.24 

5.98 ± 0.57 

Ri 5800 SEC 

LBL 

3.82 ± 0.24 


12.72 ± 0.60 


1500 SEC 

2.39 ± 0.39 

6.77 ± 0.18 

Ri 2200 SEC 

FVBM 

2.93 ±0.31 


12.79 ± 0.39 


3500 SEC 

2.10 ± 0.41 

6.84 ± 0.25 

Ri 4300 SEC 

DAE 

5.29 ± 0.44 


17.13 ± 0.62 


3200 SEC 

3.32 ± 0.53 

9.18 ± 0.21 

Ri 5800 SEC 

DEM 

6.09 ± 0.44^ 


18.83 ± 0.56 


3500 SEC 

3.75 ± 0.15 " 

77^3 9.42 ± 0.49 

Ri 4400 SEC 


co-occurrence data. The extracted latent semantic vectors could 
be able to used as semantic features for item classification and 
clustering tasks in the further. Therefore, in order to make 
the DEM learn the item semantic vector from co-occurrence 
data, we disable the Li and L 2 dependence terms, only keep 
the hidden neural layers. In the Table we present the 
experiment results of DEM* with different architectures. 
Among them, DEM*-64 x 64 achieves best result, which 
obtains 11.02 Top@l Accuracy. It approximates to the best 
result 11.32 by DEM-32 x 16 . We concatenate the hierar¬ 
chical structured embedding vectors: {R], R^,R^} as an 
single semantic vector Rt to represent the item t. In the model 
DEM*-64 X 64 , we obtain the 128 dimension semantic vec¬ 
tor for each movie. By projecting the 128 dimension vectors 


*®DEM* is an simplified DEM which removes the Li, L 2 terms in the 
dynamic energy function. 


into 2D image we obtain the movie visualization map as in 
Figure]^ In the Figure]^ it contains 500 most frequent movies. 
As we can see in the figure, the distance between similar 
movies is usually closer than un-similar movies. We also give 
some informative pieces of movies in the graph. There are 
several movie series could be discovered and grouped together, 
i.e.. Star Trek Series, Wallace and Gromit Series etc. 

VI. Conclusion and Future Work 

In the paper, we introduce a general Bayesian framework 
for co-occurrence data modeling. Based on the framework, 
several previous machine learning models, i.e.. Fully Visible 
Boltzmann Machine, Restricted Boltzmann Machine, Max¬ 
imum Margin Matrix Factorization etc are studied, which 


“We use the T-SNE (26) visualization tool to obtain the movie visualization 
Figure. 
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Fig. 3. DEM*-64 x 64 for 500 Movies Visualization on MovieLenlM dataset. (T-SNE visualization tool) 















































TABLE IV. TOP@ 1 AND TOPlO ACCURACY ON MOVIELEN DATA SET 


Models 

MOVIELENIOM (T0P5()()) 
TOP@l TOP® 10 

DEM*-16 

DEM*-16 X 16 
DEM*-16 X 16 X 16 
DEM*-32 

DEM*-32 X 32 
DEM*-32 X 32 X 32 
DEM*-64 

DEM*-64 X 64 
DEM*-64 X 64 X 64 

I. 93 ±0.15 16.65 ±0.26 

9.11 ± 0.12 36.74 ±0.16 

9.95 ± 0.43 38.65 ±0.54 

9.49 ±0.31 38.91 ±0.36 

10.26 ±0.18 40.75 ±0.14 

10.51 ±0.31 40.91 ±0.56 

10.35 ±0.28 40.93 ±0.41 

II. 02 ±0.39 41.28 ±0.39 

10.53 ±0.19 40.84 ±0.18 


could can be interpreted as one of three categories according 
to the Li, L 2 and Lk assumptions. As motivated by three 
Bayesian dependence assumptions, we developed a hierar¬ 
chical structured model or DEM. The DEM is a unified 
model which combines both the low-order and high-order item 
dependence features. While the low-order item dependence 
features are captured at the bottom layer, and high-order 
dependence features are captured at the top layer. The exper¬ 
iments demonstrate the effectiveness of DEM. It outperforms 
baseline methods significantly on several public datasets. In the 
future work, we plan to further our study along the following 
directions: 1) to develop an nonparametric bayesian model to 
automatically infer the deep structure from data efficiently 
to avoid/reduce expensive hyper-parameter sweeping? 2) to 
develop an online algorithm to learn DEM on streaming co¬ 
occurrence data. 3) to encode the frequent item set information 
using the DEM representation? 
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